Work-Stealing-based Persistent Kernel #64

neoblizz · 2026-02-05T20:34:47Z

Motivation

Dynamically take away tile ids instead of fixed partitioning.

Getting Started

git clone -b neoblizz/work-stealing https://github.com/ROCm/tritonBLAS
cd tritonBLAS
pip install -e .

# Install latest triton
git clone https://github.com/triton-lang/triton
cd triton
pip install -e .

# Work-stealing CU sweep (304 to 32 CUs)
python benchmarks/tritonblas_matmul.py \
    --input-yaml datasets/bench_8k.yaml \
    --work-stealing \
    --cu-sweep \
    --cu-sweep-max-remove 34 \
    --counters-per-xcd 1 \
    --output-csv results_ws_cu_sweep.csv

python benchmarks/torch_matmul.py \
    --input-yaml datasets/bench_8k.yaml \
    --cu-sweep \
    --cu-sweep-max-remove 34 \
    --output-csv results_torch_cu_sweep.csv

python tools/plot_cu_sweep.py \
    --persistent results_persistent_sweep.csv \
    --torch      results_torch_cu_sweep.csv \
    --ws-cpc 1   results_ws_cu_sweep.csv \
    -o cu_sweep_plot.png

Copilot

Pull request overview

This PR introduces a work-stealing-based persistent GEMM kernel that dynamically allocates tile IDs across compute units instead of using fixed partitioning. The implementation uses per-XCD (chiplet) atomic counters to reduce contention compared to global atomic operations. The work-stealing kernel is exposed as an opt-in feature through a new work_stealing parameter in the matmul APIs.

Changes:

Added MatmulConfig class to pre-allocate and manage GPU buffers for kernel launches (tile counters, stream-K locks/partials)
Implemented work-stealing kernel with per-XCD atomic tile counters in persistent_gemm_work_stealing.py
Extended all matmul APIs with optional work_stealing and config parameters to support the new kernel

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 22 comments.

Show a summary per file

File	Description
`include/tritonblas/matmul.py`	Added MatmulConfig class for buffer management; integrated work_stealing parameter and ws_persistent_matmul kernel; refactored buffer allocation to use config objects
`include/tritonblas/kernels/persistent_gemm_work_stealing.py`	New work-stealing kernel implementation with per-XCD atomic counters and dynamic tile assignment
`include/tritonblas/kernels/__init__.py`	Exported ws_persistent_matmul kernel
`include/tritonblas/__init__.py`	Exported MatmulConfig and matmul_preamble to public API
`tests/test_work_stealing.py`	Standalone test with custom module loading to test work-stealing kernel correctness and performance
`benchmarks/benchmark_work_stealing.py`	Comprehensive benchmark comparing work-stealing against static persistent, stream-K, and torch.matmul

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/tritonblas/matmul.py

tests/test_work_stealing.py

include/tritonblas/kernels/persistent_gemm_work_stealing.py

include/tritonblas/matmul.py

benchmarks/benchmark_work_stealing.py

tests/test_work_stealing.py

neoblizz added 2 commits February 5, 2026 19:46

...

2373b38

8-way spread contention.

ecc5d73

Copilot AI review requested due to automatic review settings February 5, 2026 20:34

Copilot started reviewing on behalf of neoblizz February 5, 2026 20:35 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

neoblizz added 6 commits February 11, 2026 20:46

Better APIs, clean-up, use Origami for ws.

8c1d91b

Fix UNKNOWN pip package issue.

e4908aa

Benchmark and CU sweep plot.

cfbefd8

Add a simple 8K dataset.

a1e792c

CU sweep for torch.

4e5831a

Fix the cu-masking.

f849cbe

ryanswann-amd self-requested a review February 12, 2026 17:29

neoblizz added 3 commits February 12, 2026 18:49

Stride to avoid contention on the same coherence channel.

579cf89

global-counter option.

a0a3de8

...

37c72cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work-Stealing-based Persistent Kernel #64

Work-Stealing-based Persistent Kernel #64

Uh oh!

neoblizz commented Feb 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Work-Stealing-based Persistent Kernel #64

Are you sure you want to change the base?

Work-Stealing-based Persistent Kernel #64

Uh oh!

Conversation

neoblizz commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Getting Started

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neoblizz commented Feb 5, 2026 •

edited

Loading